FMS startup

The FMS starts in completely passive mode waiting for connections from FMAs.
The first FMA to connect to the FMS will be given the expected network map
and instructed to verify the entire network.

This first FMA to join will also be designated as the "gateway" FMA
through which myrinet messages may be sent to other FMAs.

Additional nodes will likely join during the network verification process.
Any node with higher "gateway affinity" (such as being on the same node
as the FMA or being a designated gateway) than the current gateway becomes
the new gateway.

Once the verification is complete, routes will be sent to all FMAs and
the verification task will be divided up among the current node collection.
A node need not have an FMA running to have routes to it, and the gateway
FMA can be used to remotely add routes to nodes which do not have FMAs running.

The route table for single routes for 2200 nodes is 2200*(5+3+2)*2200 = 46MB.
There should not be any trouble sending this via IP to all nodes.


-------------

Error reporting/handling strategy.

When an error occurs, an error string is generated and placed
on top of an array of error strings.  If an error is fatal,
The program exits at the lowest point int the call chain that knows
the error is fatal.  exiting is accomplished by a common exit routine
so that: 1) any cleanup can be done and 2) to provide an easy place
to breakpoint if more diagnostic is needed.

If a function returns an error which is deemed non-fatal, resetting the
error index must be part of the handling.

libfma cooperates with this scheme by calling the routine specified by
lf_set_error_handler() with a descriptive string every time an error
is encountered.  if lf_set_error_handler has never been called, the error
messages are written to stdout.

libfma also provides some utility macros which may make system calls that
fail.  They rely on a macro called "LF_ERROR" which takes an error string
as an argument and "does the right thing" with it.  see fms/fms.h for
an example of how to do this.


-----------------

alerts and events


Alerts are persistant notices which become instantiated when certain 
events occur.  The FMS maintains the mast list of all alerts and clients
may request a list of outstanding alerts or react to events.

Alerts have a few different states:
	ACTIVE - the initial state of each alert
	ACKED - an alert for which the alert condition still exists, but
		has been acknoledged by a user
	RELIC - an alert whose cancellation condition has occurred, but has
		not yet been acknoledged.

Once an ACKED alert is cancelled, or a RELIC alert ACKed, the alert ceases
to exist and is removed from the system.  Not all alerts need ACKs to go
from RELIC to destruction, those that do are marked as NEED_ACK.  Not all
alerts need to be cancelled, but may rather be destroyed by being ACKed
with no other events needed.  These are marked with ACK_CANCEL.

Alert examples:

The HOST_NO_INITIAL_FMA alert is destroyed as soon as the FMA from that host
connect to the FMS.  No ACK is needed.

The LINK_DOWN alert occurs when a link goes down, but the alert persists
even after the link is restored until the alert is acknowledged.  The
alert enters the RELIC state to indicate the problem no longer exists,
but ACK is required to make sure someone knows the problem once existed.



Since many common occurances may cause a flood of events which all quickly
resolve and become "relics," the interface provides a mechanism to "ACK
all relics" which allows a used to clean up the event list with only one
action without mistakenly acknowlegding events which are obscured by the
large number of relics.


Internally to the FMS, alerts related to a given object are chained off of that
object.  When state changes on an object, such as a host or a linecard port,
that alert chain is followed for any appropriate alerts to cancel.  If a
reliced but un-ACKed alert of the same type exists for that object, it is
NOT reused and a new alert is created, leaving the relic to be ACKed.


----------------

Database upkeep

When FMS and FMAs run, they detect differences between what is in the
database and what is found in reality.

These differences can include:

New host
Missing host
Changed host
 - hostname

New NIC
Missing NIC
Changed NIC
 - serial number / mac address / # links

Missing Switch
Changed Switch

Missing linecard
New linecard
Changed linecard
 - different serial #
 - different card type

Missing link
New link
(changed links will be a combo of missing and new)


New, missing, or changed NICs are not host changes.

Make all changes to the live database, recording any changes that are made
in lists.  When a change is "committed", all we need to do is remove the 
corresponding record from the change list and mark the database as needing to 
be flushed.  Changes committed must be all-or-none - you cannot commit only a
serial number change without the corresponding MAC addr change, for example.

The absense of NICs and linecards can be explicitly detected, not so with hosts.
A host that does not respond or does not contact the FMS is flagged via an 
alert, but is not removed from the fabric.  If a new or other host reports
itself connected to the fabric in the same place as the original host, the
link to the original host is marked as "missing" and the link to the new
host is marked as "missing". Thus, if two hosts get swapped, you will (for now)
get two "new" links and two "missing" links.

NICs are determined to be absent when an FMA reports in without an expected NIC.
The NIC is removed from the current fabric map and a missing NIC notation is  
made. 


When commiting a change to the fabric, downstream commits are also implied:
  commit new host -> commit all attached new NICs

Scenarios:

replace a NIC - FMA reports in with same NIC count, same NIC slots, but
 MAC and serial are different.  NIC will get marked as "changed", links will
 remain intact.  If number of ports changed, forget all link information.
 This auto-commits.

new NIC - a NIC is reported where no NIC existed before. this may also cause
 an existing NIC to show up with a different nic_id.  Change the NIC whose
 nic_id changed, and create a new entry for the new NIC.  Keep link info
 for the pre-existing NIC.  If the old NIC changed slots, this is
 auto-committed, the new NIC is not auto-commited.

missing NIC - if a NIC is removed from a host, its links are clobbered and the
 NIC struct is deleted from the fabric.  The nic_id of remaining NICs may
 change as a result.  The nic_id changes are auto-commited, the missing NIC
 is not.  An alert is raised for a missing NIC.
